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Abstract 



We consider the use of Frank- Wolfe optimization algorithms on the dual formulation of 
structural SVMs. These yield simple algorithms which only need access to an approximate 
maximization oracle for the structured prediction problem and thus have wide applicability. 
This perspective provides insights on previous popular algorithms as we show that batch 
subgradient as well as the cutting plane algorithms are equivalent to versions of Frank- Wolfe 
algorithms, enabling us to improve on their convergence analysis by harvesting the Frank- 
Wolfe literature. Moreover, we propose a new stochastic coordinate descent version of Frank- 
Wolfe which yields a provably convergent optimization algorithm for structural SVMs with 
total run-time independent of the number of training examples, like Pegasos, but with duality 
gap certificate guarantees and step-size robustness thanks to the use of line-search. Our 
experiments on sequence prediction indicate that this simple algorithm outperforms all other 
optimization algorithms which only have access to the maximization oracle. 



1. Introduction 

The popularity of binary SVMs as a general classification toolbox has attracted interest in recent 
years for tailored convex optimization solvers applied to its large margin learning objective. The 
choice has been more limited however for the extension of SVMs to structured outputs (such as 
graphs and other combinatorial objects (Taskar et al., 2003; Tso chant aridis ct al., 2005)), probably 
due to the additional complexity of dealing with an exponential number of variables/constraints, 
and the diversity of ways to exploit their structure. On the other hand, one of the oldest (and 
simplest) constrained optimization algorithms, the Frank- Wolfe algorithm (Frank and Wolfe, 
1956) (also called conditional gradient (Bcrtsckas, 1999)), has seen a surge of interest recently, 
both in machine learning and signal processing in general (Mangasarian, 1995; Clarkson, 2010; 
Jaggi, 2011; Bach, 2011), and binary SVMs in particular (Keerthi et al., 2000; Gartner and Jaggi, 
2009; Ouyang and Gray, 2010). This is enabled by useful properties such as requiring only the 
efficient optimization of linear functions on (possibly large) constraint sets, as well as yielding 
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sparse iterates. These properties make it a natural candidate optimizer for the differentiable dual 
objective of the structural SVM formulation. 

In this paper, we thus consider the use of Frank- Wolfe optimization algorithms for structural 
SVMs (Taskar et al., 2003; Tsochantaridis et al., 2005). Despite the exponential number of 
variables to consider, the linear optimization over the constraint set turns out to be equivalent 
to the so-called loss-augmented decoding subproblem (hereafter called a maximization oracle) 
which can often be solved efficiently and is used by other popular structural SVM optimization 
algorithms such as for the subgradient algorithms (Ratliff et al., 2007; Shalev-Shwartz et al., 
2010) or the cutting plane approaches (Joachims et al., 2009; Teo et al., 2010). In this paper, 
we will focus on optimizers which only require such maximization oracles, as these give the 
widest applicability for structured prediction. Such oracles exist (sometimes in approximate 
form) for a wide range of structures, such as the dynamic programming structure coming from 
a graphical model formulation on the labels (Taskar et al., 2003), as well as other combinatorial 
objects such as graph matchings (Caetano et al., 2009) or associative Markov networks (Taskar, 
2004). In contrast, other approaches make use of more expensive oracles, such as doing marginal 
inference on a graphical model defined on the labels (called the expectation oracle in the summary 
Table 1), or doing a Bregman projection on the space of structures (Taskar et al., 2006), though 
these operations are generally less efficient than maximization oracles. We can also distinguish 
between batch algorithms which update the parameters only after processing all the training 
data, and online algorithms which update after every datapoint such as stochastic subgradient 
methods (Ratliff et al., 2007; Shalev-Shwartz et al., 2010) or the online exponentiated gradient 
approach (Collins et al., 2008). We summarize a few of the most popular algorithms in Table 1 
with their convergence rates quoted in number of oracle calls to reach an accuracy of e in terms 
of the relevant quantities defined in Section 2. 

By considering the Frank- Wolfe perspective, we make the following contributions in this paper: 

• We show that the batch Frank- Wolfe algorithm on the structural SVM dual objective is 
equivalent to batch subgradient descent in the primal, suggesting a new line-search version of 
the subgradient method, as well as improving its convergence rate over the analysis provided 
in Shalev-Shwartz ct al. (2010). We also show that the min-norm-point extension of Frank- 
Wolfe, which in each iteration re-optimizes over all previously visited coordinates (Clarkson, 
2010), is equivalent to the cutting plane algorithm of SVM-Struct (Joachims et al., 2009). 
Because of recent advances giving duality gap convergence rates for Frank- Wolfe algorithms 
with e- approximate oracles, we obtain new primal rate guarantees for the batch subgradient 
and cutting plane algorithms even with approximate oracles. 

• We propose a new stochastic coordinate descent version of Frank- Wolfe on product domains 
with similar provable convergence rates. By applying it to structural SVMs, we obtain 
an online algorithm where the number of required oracle calls to reach a precision e is 
independent from the number of training examples (see Table 1), where only e-approximate 
maximization oracles are required, and which can provide a duality gap certificate. 

• Our experimental results on sequence prediction match closely the theoretical analysis: the 
line search yields a significant advantage in the first few passes compared to the stochastic 
subgradient approach of Pegasos (Shalev-Shwartz et al., 2010), and there is a systematic 
(but smaller) advantage in the later passes due to the difference of a logarithmic factor in 
the rates. 

2. Problem Setup: Large Margin Structured Prediction 

We briefly review the standard convex optimization setup for large margin learning for structured 
prediction (Taskar et al., 2003; Tsochantaridis et al., 2005). In structured prediction, the goal is 
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primal error w.h.p. 


maximization 




this paper: stochastic coordi- 
nate descent Frank- Wolfe 


primal-dual 


expected duality gap 


maximization 
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Table 1: Convergence rates given in the number of calls to the oracles for different optimization algoritfims 
for tlie structural SVM objective (1) in the case of a Markov random field structure, to reach 
a specific accuracy e measured for different types of gaps, in term of the number of training 
examples n, regularization parameter A, size of the label space |3^|, maximum feature norm 
R aia.Xi y \\xpi{y)\\2 (some minor terms were ignored for succinctness). The O notation ignores 
the logarithmic terms - this appears in the case of Pegasos as its bound on the error decreases 
as log(fc)/A:, where k is the number of iterations. Notice that only Pegasos and our proposed 
algorithm have rates independent of n. 

to predict a structured object y € y{x) (such as a sequence of tags) for a given input x £ X. 
In the standard approach, a structured feature map (f) : X x y ^ % encoding the relevant 
information for input/output pairs is defined in such a way so that a linear classifier hw with 
parameter w and taking the form hw(x) = argmaXy^yf^^^{w, (f){x, y)) can be computed efficiently 
(for example using dynamic programming). Given a labelled training set 2? = {{xf ,yf)}f^i, w 
is learned by solving the following optimization problem which encodes the large margin criterion 
for structural SVM (Tsochantaridis et al., 2005; Taskar et al., 2003): 




=-tpi(y) =:Li{y) —.yi 



Li{y) := L{yf,y) denotes the task-dependent structured error of predicting output y instead of 
the observed output yj^ (a Hamming distance between the two labels for example), is a slack 
variable measuring the surrogate loss for the i-th data point (how well the margin is satisfied) 
and A is the regularization parameter. The convex problem in (1) is what Joachims ct al. (2009, 
Optimization Problem 2) calls the n-slack structural SVM with margin-rescaling. Besides the 
margin-res caled formulation above, (Tsochantaridis et al., 2005) also proposed a slack-res caled 
variant, which we note can be obtained by replacing il^i{y) in (1) by il^f^iy) '■= Li{y)rpi{y). 

Non-Smooth Formulation and Loss-Augmented Decoding For structured prediction, the above 
problem can have an exponential number of constraints due to the combinatorial nature of y. 
We can replace the Yli linear constraints with n non-linear ones by defining the structured 
hinge-loss: 

Hi{w) := max Li{y) - {w, tpi{y)) . (2) 
y^y-i ^ V ' 

=: Hi{y;w) 

The constraints in (1) can thus be replaced with the non-linear ones > Hi{w). The computation 
of the structured hinge-loss for each i amounts to finding the most "violating" output y for a 
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given input Xi , a task which can be carried out efficiently in typical structured prediction settings 
(under suitable assumptions of decomposability of the feature map cf) and the error function L). 
This problem is called the loss-augmented decoding subproblem. In this paper, we only assume 
access to an efficient solver for this subproblem (the maximization oracle). Because the sum of 
is minimized in (1), the constraint > Hi(w) is tight at the optimum and so an equivalent 
non-smooth unconstrained formulation of (1) is: 

A 1 " ~ 

min - \\w\\'^ + Hi{w). (3) 
■w 2 n 

i=l 

An obvious algorithm to optimize this formulation is to use subgradient descent, such as in Ratliff 
et al. (2007). A subgradient of Hi{w) with respect to w is easily found as —tpi{y*) for y* being 
a maximizer in the loss-augmented decoding subproblem (2). In the slack-rescaled variant of 
the structural SVM, the loss-augmented subproblem (2) with tp^^{y) in place of ipi{y) generally 
becomes more difficult to solve than in the margin-rescaled variant. For this reason we here focus 
on the margin-rescaled version, however all the analysis also applies to the slack-rescaled case. 



The Dual of the n-Slack-Formulation The Lagrange dual problem of the above n-slack-formulation 
(1) has m := |3^j| many variables or potential 'support vectors'. It is given by: 



min ficx) := — 



y) 



ie[n],yeyi 



An 



i[y) 



M{y) 



je[n],yeyi 



n 



(4) 



=; w=Aa =: a 

s-t- Y.yey, ^iiy) = 1 ^ N and a^{y) > Vi G [n], G 3^^. 



'^^^y^ obtained from the KKT condition which needs 

see Appendix B. 



We denote by ai{y) the dual variable associated to the training example i and potential output 
y ^ In this work, for some given dual variable vector a, we will often consider the corre- 
sponding primal variable w = Y.i,yeyi 

to hold at optimality of the two above convex optimization problems 

To simplify notation, we introduce the matrix A G M'^^™ consisting of the m many columns 
A = {^'4'i{y) G M'^ I i G M,y G 3^i}. Using this, our primal-dual correspondence between w 
and ex simply writes as, w = Acx . Also, our dual optimization objective (4) simplifies to f{cx) := 
I ll^ccll^ — b^a for the fixed vector b G M™ s.t. b := (^-Li(y)) . Here the domain 



A4 C M™ is the product of n simplices, M- := A 



\yi\ 



X A 



l^nl 



'i&[n],y€yi- 

Since all methods considered in 



this paper are first-order optimization algorithms, the gradient V/(q:) 
of our (dual) objective function f{a) will be a crucial quantity. 



XA^Aa-b = XA^w-b 



3. Frank-Wolfe Algorithms for Constrained Convex Optimization 

In this section, we review the Frank- Wolfe algorithm and present a new general coordinate- 
wise version which is of interest for general constrained optimization over product domains. We 
describe how to apply Frank- Wolfe on the dual of the structural SVM in Section 4 and the 
coordinate-wise version in Section 5. 



The Frank- Wolfe Algorithm We consider the convex optimization problem minQ,g_A4 /(o;), 
where the convex feasible set M. is compact and the convex objective / is continuously differen- 
tiahle. The Frank- Wolfe algorithm (Frank and Wolfe, 1956) (listed in Algorithm 1) is an iterative 
optimization algorithm for such problems that only requires to be able to optimize linear func- 
tions over M , and thus has wide applicability. At every iteration, a feasible search corner s is first 
found by minimizing over Ai the linearization of / at the current iterate a (see picture in inset). 
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Algorithm 1: Frank- Wolfe on a Compact 
Convex Domain 

Let q:(°) e M 

for /fe = . . . K do 

Let 7 := ^2 

Compute s := argmin /s', V/(q;^^^) \ 
s'eM ^ ' 
Optionally: Choose 7 by hne-search 

Update a^^+i) := (1 - 7)q('=) + 7s 

end 



Algorithm 2: Coordinate Descent with 
"Frank- Wolfe" -Type Updates on Product Do- 

main 

Let eM = X(^) X ... X A^W 
for /c = ... K do 
Pick i Gu.a.r. [n] 
Let 7 := ^ 

Find := argmin (s',^.,V^i)f{a^'''^)) 

Optionally: Choose 7 by line-search 
Update ajf+^^ := Q[Jj^ + 7(s(,) - 
end 




The next iterate is then obtained as a convex combination 
of s and the previous iterate, with step-size 7. These sim- 
ple updates yield two additional interesting properties for this 
algorithm. First, every iterate o;^*^^ can be written as a con- 
vex combination of the starting point a^^^ and the search cor- 
ners s found previously, o;*-'^-* thus has a sparse representation, 
which makes the algorithm suitable even for cases where the 
dimensionality of a is huge. Second, since / is convex, the 
minimum of the linearization of / over Ai immediately gives 
a lower bound on the value of the yet unknown optimal so- 
lution /(a*). Every step of the algorithm thus computes for 
free the following "linearization duality gap" defined for any 

feasible point a £ Ai (which is in fact a special case of the Fenchel duality gap as explained in 
Appendix A): 

g{a) := max {a - s', V/(a)) = {a - s, V/(a)). (5) 

s'€M 

As g{a) > f{ot) — f{cx*) by the above argument, s thus readily gives at each iteration the current 
duality g certificate for the current approximation quality^ allowing us to monitor the 

convergence, and more importantly to choose the theoretically sound stopping criterion g{cx^''^) < 
e, instead of specifying a maximum iterate K. 

In terms of convergence, it is known that, after O(^) many iterations, Algorithm 1 obtains an 
e-approximate solution (Frank and Wolfe, 1956; Dunn and Harshbarger, 1978), and a guaranteed 
e-small duality gap (Clarkson, 2010; Jaggi, 2011), along with a certificate s. For the convergence 
results to hold, the internal linear subproblem must not necessarily be solved exactly, but only 
to some additive error, as we will briefly discuss in Section 6. We will generalize and review the 
convergence proof in Appendix E. The constant hidden in the O(^) notation is the curvature 
constant Cf (alternatively also called the strong smoothness constant of /), which is essentially the 
Lipschitz-constant of the gradient V/, times the squared diameter of Ai, see e.g. our Appendix 
C for a formal definition. 



Block-Coordinate Frank-Wolfe Algorithm for Product Domains Algorithm 2 represents the 
main new optimization contribution of this paper, being a provably convergent block-coordinate 
version of the Frank- Wolfe algorithm for constrained convex optimization problems 



min /(o:) 
c»eX(^)x...xX(") 



(6) 



^See also Jaggi (2011). 
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where the domain has the structure of a Cartesian product = A^^^^ x . . . x 
over n many factors, or blocks. The main idea of the method is to perform cheaper update 
steps that only affect a single variable block and not all of them simultaneously. This is 

motivated by coordinate descent methods, which have a very successful history when applied to 
large scale optimization. Here we assume that each factor A4^^^ C R™» is convex and compact, 
and Y17=i ~ ^^^^ write Q;(j) E M™' for the i-th block of coordinates of a vector a E M"^. 

Algorithm 2 picks a single one of the n factors uniformly at random, and in each step leaves all 
other factors unchanged. If there is only one factor (n = 1), then Algorithm 2 becomes the stan- 
dard Frank- Wolfe Algorithm 1. The algorithm can be interpreted as a simplification of Nesterov's 
"huge-scale" uniform coordinate descent method (Nesterov, 2010, Section 4). Here, instead of 
solving a more complicated proximal operator, we only need to solve a linear subproblem in each 
iteration. 

Convergence Results The following main theorem shows that after O(^) many iterations. Al- 
gorithm 2 obtains an e-approximate solution, and a guaranteed e-small duality gap. Here the 
constant Cy°'^ := Yl^=i is the sum of the (partial) curvature constants of / with respect 
to the individual domain factors We discuss this Lipschitz-assumption on the gradient in 

more detail in Appendix C, and will compute this constant precisely for the structural SVM in 
Section 5. In the following convergence results, ho := f{cx^^^) — f{ct*) is the initial error at the 
starting point of the algorithm, for a* £ Ai being an optimal solution. Proofs are provided in 
Appendix E (cf. Theorems 7 and 9). 

Theorem 1. For each k > 0, the iterate o;^*''^ of Algorithm 2 (either using the predefined step- 
sizes, or using line-search) satisfies E[f{a^^^)] — f{a*) < (2(7^™'^ + /iq) , where a* £ M is 
an optimal solution to problem (6), and the expectation is over the random choice of the factor i 
in the steps of the algorithm. 

Furthermore, if Algorithm 2 is run for K > 2 iterations, then it has an iterate a^''\ ^ ^ k < K , 
with expected duality gap bounded by E [^((q;^'^^)] < ^(2Cj™'' + /iq) • 

4. The Frank-Wolfe Algorithm for Structural SVMs 

In this section, we explain how the Frank- Wolfe Algorithm 1 can be efficiently applied to solve 
the dual problem (4) of the structural SVM and show its relationship to other algorithms. Recall 
that the optimization domain for the dual variables cx is the product of n simplices, Ai = 
A|3;^l X ... X A|3;^|. Since each simplex consists of a potentially exponential number |3^j| of dual 
variables, we cannot maintain a dense vector a during the algorithm. Note though that each 
"corner" s of this domain corresponds to putting all the mass in each simplex i on a single 
labeling yi ~ i.e. s := (e^^ , • . • , e^") E M, where e^* is a vector of zeros everywhere except with a 
one at the coordinate ai{yi). The main insight which enables us to apply Frank- Wolfe here is to 
notice that the linear subproblem used in Frank- Wolfe, which reduces to a search over all corners 
of the domain, is actually equivalent to solving the loss- augmented decoding subproblem on each 
datapoint, and thus can be done efficiently (see Appendix D.l for details). Moreover, as mentioned 
in section 3, each iterate cx^^^ of the Frank- Wolfe algorithm is a sparse convex combination of the 
previously visited corners s and the starting point a^^\ and so we would only need to maintain 
the list of previously seen solutions to the loss-augmented decoding subproblems to keep track 
of the non-zero coordinates of a, avoiding the problem of its exponential size. Finally, in the 
case that we do not use kernels, we avoid the quadratic explosion of the number of operations 
needed in the dual by not explicitly maintaining a^''^ but rather only maintaining explicitly 
the corresponding primal variable vector w^'^^ := Acx^^\ The resulting Algorithms 3 and 4 are 
equivalent to the original optimization Algorithms 1 and 2, but the iterates are only represented 
in the primal. 



6 



Algorithm 3: Batch Primal-Dual Frank- 
Wolfe Algorithm for the Structural SVM 

Let := 0, := 
for /c = ... K do 

Let 7 := fcT2 
for i = 1 . . .n do 

Solve Ui := argmax Hi{y;w^^^) 

end 

Let Ws := Er=i W.'^iiVi) 

and 4 ^Er=i^»(yO 
Optionally: Choose 7 by line-search 

Update w^^~^^^ := (1 - -i)w^^^ + 7 if s 
and ^C^+i) (1-7)£W -1-74 

end 



Algorithm 4: Coordinate Descent Frank- 
Wolfe Algorithm for the Structural SVM 

Let := Wi^'''^ := 0, ^(o) := ^^(o) := 
for = . . . K do 

Pick i e^.a.r. [n] 

Let 7 := ^ 

Solve := argmax Hi{y;w^^^) 

Let := -^il^iiyi) and 4 := ^LiiVi) 
Optionally: Choose 7 by line-search 
Update Wi^^^^^ := (1 - j)wi''''^ +JWs 

and ^.C^+i) :=(l-7)£,W-f7 4 
Update lu^'^+i) := wC^) + ^^(^+1) - Wi'-''^ 

and ^C^+i) ^W+^.^'^+i) 

end 



A Primal-Dual Frank-Wolfe Algorithm for the Structural SVM Dual By applying the Frank- 
Wolfe Algorithm 1 to the dual of the structural SVM (4), but only maintaining the primal 
iterates w we obtain Algorithm 3. Note that the Frank- Wolfe search corner s = 

(e^i, . . . ,e^"), which is obtained by solving the loss-augmented subproblems, yields the update 
vector Ws = As. Furthermore, we have used the natural starting point a^^^ := (e^i , . . . ,e^" ) 
which yields w^^^ = as il)i{yf) = for all i. The £ quantities are maintained for the computation 
of the duality gap as well as the line-search step-size (see below). 

The Duality Gap The duality gap (5) for our structural SVM dual objective function (4) here is 
given by (7(a) := maxg'g^ {a — s' ,V f{a)) = {a — s)'^ [XA^ Aa — b) = X{w — As)'^w — b'^a + b^ s, 
where s is an exact minimizer of the linearized problem given at the point a. Note that this 
(Fenchel) duality gap turns out to be the same as the Lagrangian duality gap (see Appendix B.2), 
and so gives us a direct handle on the suboptimality error of w^^^ for the primal problem (3). 
Using Wg := As and ig ■= b^s, we observe that the gap is well-defined and efficient to compute 
when we only have the primal variables w = Aa and £ = b^a available, and becomes 

g{w, £, WsJs) ■= X{w - Ws)'^w - £ + £s= g{cx) . (7) 

Since the quantities w,£,Ws,£s are maintained during the run of Algorithm 3, we can keep track 
of the duality gap, and use g{a^''^) = g{w^^\£^^\ws,£s) < e as the proper stopping criterion. 



Implementing the Line-Search Because the objective of the structural SVM dual (4) is a 
quadratic function in a, the optimal step-size for any given candidate search point s ^ Ai can be 
obtained analytically with a simple formula. Namely, if we let 7^5 = argmin / (a -|- 7(5 — o;)) , 

76[0,1] 

we have that 7^5 := max {0, min {1, 7opt}}, where 7opt is obtained by setting the derivative of the 
corresponding univariate quadratic function in 7 to zero, which here gives ^opt '■= ^^|~^^'^^^2^ = 

g{'W,e,Wa/3) 



Xllw—Ws I 



^. The first equality is valid for any search point s £ A4 (and so can also be used for the 

other variant described in Algorithm 4 or for an approximate search point), whereas the second 
equality (with the duality gap) is only valid for s being the exact minimizer of the linearized 
problem at a. In both cases, all the necessary quantities to compute 'Jls are maintained during 
the run of Algorithm 3, as was the case for the duality gap. 
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Convergence Proof and Running Time In the following, we write R for the maximal length 

of a difference feature vector, i.e. R := max 11-04(^)112, and that the maximum error L^ax := 

i&[n],yGyi 

maxj^y Li{y). By bounding the curvature constant Cj for the objective function / being the dual 
SVM objective (4), we can now directly apply the known convergence results for the standard 
Frank- Wolfe algorithm. This then directly leads to the main result of this section, that the 
number of iterations of Algorithm 3 is independent of the data size n. 

Theorem 2 (Batch FW SVM). Algorithm 3 obtains an e -approximate solution to the structural 
SVM dual problem (4) and duality gap g{cx^'^'>) < e after at most O (^^^ iterations, where each 
iteration costs n oracle calls. 

Since we have proved that the duality gap is smaller than e, this implies that the original SVM 
primal objective (3) is actually solved to accuracy e as well. 

Relationship with Batch Subgradient Descent in the Primal The batch Frank-Wolfe Algo- 
rithm 3 is equivalent to subgradient descent in the primal, though with a clever choice of step-size 
in case the line-search in the dual is used. To see this, notice that a subgradient of (3) is given 
by dsuh = Xw — ^Yli'^iiVi) — ^{'^ ~ '^s), where yi and Wg are as defined in Algorithm 3. 
Hence, for a step-size of /3, the subgradient descent update becomes w^^'^^^ := w^'^^ — fidgub = 
w^^'> — f3X{w^^^ — Ws) = (1 — (3\)w^^^ + pXws- Comparing this with Algorithm 3, we see that 
each Frank- Wolfe step on the dual problem (4) with step-size 7 is equivalent to a batch subgra- 
dient descent step in the primal with a step-size of (3 = 7/ A, and thus our convergence results 
also apply to it. This seems to generalize the equivalence between Frank- Wolfe optimization and 
subgradient descent for a quadratic objective with identity Hessian which was already observed 
in Bach et al. (2012, Section 4.1) and in Bach (2011, Section 6.3). 

Relationship with Cutting Plane Algorithms The cutting plane algorithm of (Joachims et al., 
2009) (either in its 1-slack or n-slack version) finds iteratively some new coordinates to add 
to the dual problem (which equivalently correspond to constraints in the primal problem) by 
solving the same loss-augmented decoding problem for each datapoint that we use in the batch 
Frank- Wolfe algorithm. But instead of doing a line search towards the corner s, as is done in 
Frank- Wolfe, it re-optimizes the QP over all the previously added constraints. The minimum- 
norm-point extension of Frank- Wolfe, which in each iteration re-optimizes over all previously 
visited coordinates (Clarkson, 2010), can thus be seen equivalent to it. The minimum-norm- 
point convergence results simply reuse the one from Frank- Wolfe with line search, thus all our 
results apply as well to the cutting plane algorithm of (Joachims et al., 2009). 

5. Faster Frank-Wolfe Coordinate Descent for the Structural SVM 
Dual 

Algorithm 4 represents our new coordinate descent Frank- Wolfe Algorithm 2 applied to the SVM- 
dual problem (4). Using that the updates Wg and £s correspond to Wg = and ig = ^"^s^j] 
where s^q is the padding with zeros of := e^' G A^*-*-* so that s^q G M, we obtain that 
Algorithm 4 is actually equivalent to Algorithm 2. 

Convergence Proof and Running Time In order to apply the convergence result of our new 
product Frank- Wolfe algorithm to the SVM case, we need to bound the total curvature con- 
stant Cy""^ for the dual SVM problem. This then directly leads to the following theorem, showing 
that the total running time of Algorithm 4 is independent of the data size n, when measured 
in the number of oracle calls. In other words, the theory says that our product Algorithm 4 
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is n times faster when compared to the batch Frank- Wolfe version in Algorithm 3, where each 
iteration requires n oracle calls. This speed-up is enabled by the fact that for the SVM dual opti- 
mization problem, the product curvature Cy°'^ turns out to be n times smaller than the classical 
Frank- Wolfe curvature constant Cf. We will also practically observe this running time difference 
in the experiments in Section 7. 

Theorem 3 (Stochastic FW SVM). // L,„„^ < then Algorithm 4 obtains an e- approximate 

solution to the structural SVM dual problem (4) and expected duality gap 'Ei[g{cy.^^'>)] < e after at 
most O (^^^ iterations, where each iteration costs a single oracle call. 

If L^a^ > then the line-search variant of Algorithm 4 wHl require an additional (constant 
in e) number of 2n log steps to get the same error and duality gap guarantees, whereas 

the predefined step-size variant will require an additional O ( ^ steps. 

6. Extensions 

Approximate Linear Subproblems and Approximate Decoding Interestingly, it can be shown 
that the convergence results we presented above also hold if approximate minimizers of the 
linear subproblems are used instead of exact minimizers. More formally, we require that the 
step direction in Algorithm 2 (or s in Algorithm 1) is chosen such that V(j)/(Q;(^))) < 
argming^^g^{i) (s(j), V(j)/(Q;'-'^^)) + Sk, where the additive approximation quality is defined as 

Efc '■= IkCj , for 7fc being the predefined step size. With a step-oracle of this accuracy, the above 
convergence bounds from Theorem 1 do still apply to the approximate version of Algorithm 2 
(and so Algorithm 1 using n = 1). The only change being that the upper bound on the errors 
are multiplied by a factor of two. A proof of this generalization is also provided in Appendix E. 

This makes the algorithm more applicable to large-scale applications where it is too costly to 
do the linear optimization exactly. In the case of structural SVMs, this means that we can run 
the above mentioned algorithms with approximate loss-augmented decoders which is crucial for 
many applications. 

Kernelized Algorithm Both Algorithms 3 and 4 can be used with kernels by maintaining ex- 
plicitly the sparse dual variables a^''^ instead of the primal variables w^^\ In this case, the 
classifier is only given implicitly as a sparse combination of the corresponding kernel functions, 
i.e. w = Aa. Using our Algorithm 4, we obtain the currently best known bound on the number 
of support vectors, i.e. a guaranteed e-approximation consisting of only ^ many support vec- 
tors. For comparison, the standard cutting plane method (Joachims ct al., 2009) adds n support 
vectors il^i{y) at each of their iterations. A more detailed version of the kernelized variant of 
Algorithm 4 is given in Appendix D.4. 

7. Experiments 

Here we compare existing algorithms for solving the structural SVM problem to our novel Frank- 
Wolfe approaches. The different methods are applied to the OCR dataset from (Taskar et al., 
2003) and the CoNLL dataset (Sang and Buchholz, 2000). Both datasets correspond to sequence 
labeling tasks, for which the loss- augmented decoding problem is solved exactly by the Viterbi 
algorithm. The Frank- Wolfe variants studied are: batch Frank- Wolfe with and without line-search 
(fw-ls and fw) as in Algorithm 3 and stochastic Frank- Wolfe with and without line-search (sfw-ls 
and sfw), see Algorithm 4. We include the following competing methods in the comparison: 
The cutting plane algorithm implemented in SVMstruct (Joachims et al., 2009) with its default 
options, standard and averaged Pegasos (Shalcv-Shwartz ct al., 2010) and the optimal stochastic 
subgradient method from (Rakhlin et al., 2012) which is the same as Pegasos but the averaging is 
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only started after half of the iterations (which yields a convergence rate of 0{\/k) vs. O (logk/k) 
for Pegasos). The progress of the different algorithms is visualized in Figure 1. Figure la 
compares the batch algorithms to the stochastic Frank- Wolfe algorithm with line-search. We 
observe that in about 20 passes through the dataset, stochastic FW achieves a solution close to 
the optimum, whereas the batch solvers do not converge even within 150 iterations. The improved 
convergence of the stochastic algorithms when compared to the batch solvers was systematic in 
all experiments for large A. However, for small values of A, the stochastic approaches without line- 
search, such as Pegasos, often perform worse than the batch algorithms. The suggested stochastic 
version of Frank- Wolfe with line-search does not share this weakness and performs well in both 
settings. Figure lb and Ic show the progression of the stochastic solvers for the OCR and the 
CoNLL datasets. In both of these experiments Algorithm 4 with line-search outperforms the other 
stochastic algorithms. The improvement is especially large in the first iterations. We conjecture 
this is due to the line-search advantage, as there is also an improvement in the first few passes 
for sfw-ls vs. sfw. Finally, Figure Id shows the test error during the optimization - there, the 
stochastic Frank- Wolfe algorithm still shows an advantage, though smaller than for the primal 
objective. 




20 40 60 80 100 120 140 
effective passes 

(a) Batch vs. stochastic for A = 0.01 on OCR. 
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(b) Stochastic solvers for A = 0.01 on OCR. 
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(c) Primal objective for A = 0.001 on CoNLL. 
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(d) Test error for A = 0.001 on CoNLL. 



Figure 1: In our experiments the stochastic Frank- Wolfe algorithm (sfw-ls) with line-search dom- 
inates all competing approaches. See text for details. 



8. Related Work 

There has been a fair amount of work for coordinate descent methods for the dual of binary SVM 
such as with the original SMO algorithm. The SMO algorithm was generalized for the factored 
representation of the Max-Margin Markov network version of structural SVMs (and thus using 
something equivalent to an expectation oracle) in Taskar (2004, Chapter 6), but its convergence 
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rate scales badly with the size of the output space: it was estimated as O (n|3^|/Ae) in Zhang 
ct al. (2011). This is the only coordinate descent method for the dual of structural SVM with 
rate guarantees that we are aware of (in addition to the new stochastic Frank- Wolfe algorithm). 
Rousu proposed in Rousu et al. (2006) another coordinate descent method on the factored dual: 
they pick one sample at a time and optimize on the relevant subspace using multiple Frank- Wolfe 
updates, though with no global rate guarantees. Interestingly, our stochastic Frank- Wolfe algo- 
rithm applied to binary SVM reduces to the dual descent method proposed in Hsich ct al. (2008), 
which samples one random datapoint at a time to update its associated unique dual variable and 
so exact coordinate optimization can be accomplished (in our formulation, we get a 2-simplex 
for binary SVM and so the line search also gives the exact subspace optimizer). In this case, 
they prove a locally linear rate of convergence, thanks to the exact subspace optimization. We 
note that our duality gap guarantee for stochastic Frank- Wolfe thus clarifies that the dual descent 
method of (Hsieh et al., 2008) also yields a primal convergence guarantee of O (1/A;), thus improv- 
ing over Pegasos - this seems to have been observed empirically in Shalcv-Shwartz et al. (2010). 
The work of (Hsieh et al., 2008) was generalized to the structural SVM in Balamurugan ct al. 
(2011) in a sequential dual minimization approach where a QP on each example is approximately 
solved sequentially using SMO, but with no rate guarantees. Finally, we note that few guarantees 
are given for the optimization of structural SVMs with approximate oracles. The cutting plane 
algorithm of (Tsochantaridis ct al., 2005) was analyzed in Finlcy and Joachims (2008) when the 
maximization oracle gives a multiplicative error, though the dependence of the running time of 
this algorithm to achieve an e-approximate solution in term of this multiplicative error was left 
unclear. In contrast, we provide guarantees for batch subgradient, cutting plane as well as the 
stochastic Frank- Wolfe algorithm to achieve an e-approximate solution as long as the additive 
error of the oracle is within e. We get stronger guarantees, but also with a stronger assumption, 
as the approximation errors of maximization oracles are usually specified with a multiplicative 
factor. 

9. Conclusion 

We highlighted the equivalence of batch Frank- Wolfe on the dual of structural SVM with batch 
subgradient in the primal, thereby obtaining a line search version which had better robustness 
and improving on its rate analysis. We conjecture that this kind of equivalence could be general- 
ized to other quadratic objectives and thus could provide additional insights on other traditional 
algorithms. We proposed a new stochastic coordinate descent Frank- Wolfe algorithm on arbitrary 
compact product domains with provable convergence guarantees. When applying it to structural 
SVMs, we obtain a simple online algorithm which converges empirically faster than other stochas- 
tic algorithms on the first few passes of the data, thanks to an increased robustness from the line 
search. As step-size selection is a notoriously hard problem for stochastic subgradient methods, 
the line search gives it a significant advantage. The stochastic coordinate descent Frank- Wolfe 
optimization algorithm being applicable to general compact product domains and convex func- 
tions, we believe that it can provide a new optimization tool of choice in the machine learning 
toolbox. 
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A. Equivalence of the "Linearization"-Duality Gap to a Special Case 
of Fenchel Duality 

For our used constrained optimization framework, the notion of the simple duahty gap was crucial. Con- 
sider a general constrained optimization problem 

min /(x) , (8) 
xeM 

where the domain (or feasible set) C A" is an arbitrary compact subset of a Euclidean space X. We 
assume that the objective function / is convex, but not necessarily differentiable. 

In this case, the general "linearization" duality gap (5) as proposed by (Jaggi, 2011, Section 2.2) is given 

by 

5(x;dx) -lM(-dx) + (x,dx) . (9) 

Here dx is an arbitrary subgradient to / at the candidate position x, and I^(y) := sup^g_^ (^jY) is the 
support function of the set M. 

Convexity of / implies that the linearization /(x) + (s — x, d^) always lies below the graph of the function 
/, as illustrated by the figure in Section 3. This immediately gives the crucial property of the duality gap 
(9), as being a certificate for the current approximation quality, i.e. upper-bounding the (unknown) error 
<?(x) > /(x) — /(x*), where x* is some optimal solution. 

Note that for differentiable functions /, the gradient is the unique subgradient at x, therefore the duality 
gap equals ^(x) := f;(x; V/(x)) as we defined in (5). 

Fenchel Duality Here we will additionally explain how the duality gap (9) can also be interpreted as a 
special case of standard Fenchel convex duality. 

We consider the equivalent formulation of our constrained problem (8), given by 

min /(x) + Ia^(x) . 

Here the set indicator function Im of a subset C A" is defined as Iai(x) := for x G and 
l7Vi(x) +00 for X ^ A^. 

The Fenchel conjugate function /* of a function / is given by /*(y) := sup^g;^,(x, y) — /(x). 

For example, observe that the Fenchel conjugate of a set indicator function 1m{-) is given by its support 
function (.). 

From the above definition of the conjugate, the Fenchel- Young inequality f{x.)+f* (y) > (x, y) Vx, y (z X 
follows directly. 

Now we consider the Fenchel dual problem of minimizing p(x) :— /(x) -1- I^(x), which is defined as to 
maximize d{y) := — /*(y) — I^(— y). By the Fenchel- Young inequality, and assumed that x G M, we 
have that Vy G X, 

p(x)-rf(y) = /(x)-(-r(y)-i;,(-y)) 
> (x,y)+i;,(-y) 

= 3(x;y) . 

Furthermore, this inequality becomes an equality if and only if y is chosen as a subgradient to / at x, that is 
if y :— — c?x- The last fact follows from the known equivalent characterization of the subdifferential in terms 
of the Fenchel conjugate: 9/(x) := {y G A" | /(x) + /*(y) = (x, y)}. For a more detailed explanation of 
Fenchel duality, we refer the reader to the standard literature, e.g. Borwcin and Lewis (2006, Theorem 
3.3.5). 

To summarize, we have obtained that the simpler "linearization" duality gap g(jx.; dx) as given in (9) is 
indeed the difference of the current objective to the Fenchel dual problem, when being restricted to the 
particular choice of the dual variable y being a subgradient at the current position x. 

B. Derivations of the SVM Dual Formulations 
B.l. Derivation of the n-Slack Dual 

Proof of the dual of the n- Slack- Formulation. See also Collins et al. (2008). For a self-contained explana- 
tion of Lagrange duality we refer the reader to Boyd and Vandenberghe (2004, Section 5). The Lagrangian 
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of (1) is 



A 1 " 1 

L{w,$,,a) = -(w,w) + V -a,{y){-^, + {w,-xpi{y)) + L,{y)) , 

where a = (q;i, . . . , q;„) G M'-^^' x ••• x = M™ are the corresponding (non-negative) Lagrange 

multiphers. Here we have re-scaled the multiphers (dual variables) by a constant of -, corresponding to 
multiplying the corresponding original primal constraint by ^ on both sides, which does not change the 
optimization problem. 

Since the objective as well as the constraints are continuously differentiable with respect to (w,^), the 
Lagrangian L will attain its finite minimum over a when ,c)L(i(;, ^, cc) = 0. Making this saddle-point 
condition explicit results in a simplified Lagrange dual problem, which is also known as the Wolfe dual. 
In our case, this condition from differentiating w.r.t. w is 



1 



(10) 



And differentiating with respect to and setting the derivatives to zero gives^ 

^ ai(y) = 1 Vi e [n] . 

yeyi 

Plugging this condition and the expression (10) for w back into the Lagrangian, we obtain the Lagrange 
dual problem 



max — 



2^ ai{y) 



An 



ieln],yeyi 



s.t. 



ie[n],yey, 

^ a^{y) = 1 yie [n], 
yey 

and a^{y) > Vie [n], Vy £ , 
which is exactly the negative of the quadratic program claimed in (4) . 



Hy) 

n 



□ 



B.2. Relation between the Fenchel Gap and the Classical Lagrange Duality Gap 

Consider the difference of our objective at w := Acx. in the primal problem (1), and the dual objective at 
OL in problem (4) (in the maximization version). This difference is 



ffLa 



1 " 

= Xw^w — b^a H — 



We now suppose that we use the minimizing slack variables for the primal problem (1) (i.e. we are 
computing the primal objective (3)) - thus we set = maxy^y. Hi{y;w) (cf. (2)). Furthermore, we use 
that by the definition of A,b, we have Hi{y;w) = n{b — XA'^w)(^i yy Summing up over all points, we 
obtain 

^ n n n 

't = l z=l z = l 

Now the equivalence of the two expressions follows along the same lines as in Lemma 6 on the definition 
of the update direction s. Formally, since s in Algorithm 3 was precisely defined to give ^ X]r=i ~ 
X]"=i maxygj;. (b— XA'^w)i^i y-j = s^{b — XA'^w), we conclude that the standard Lagrangian gap is indeed 
identical to the used duality gap as defined in (5) and (7) , i.e. 



5l, 



agrangc 



(to, a) = X{w As)^ w ~ b^ a + b^ s . 



^Note that because the Lagrangian is linear in ^i, if this condition is not satisfied, the minimization of the 
Lagrangian in yield —oo and so these points can be excluded. 
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C. The Curvature Constants Cf and Cj"" 

The Curvature Constant Cf The curvature constant Cf (alternatively also called the strong smooth- 
ness constant of /), is given by the relative deviation of our function / from its linear approximations, 
over the domain A4. Formally, 

Cf:= sup 1 (/(y)_/(x)-(y-a;,V/(x))) . (11) 
x,seM, 

y=x+-f{s — x) 

It is known that C/ is upper bounded by the Lipschitz-constant of the gradient V/ times the squared 
diameter of M (Clarkson, 2010; Jaggi, 2011). 

The Product Curvature Constant CJ"" The curvature concept can be generalized to our setting of 
product domains A4 := A^^^-* x . . . x A^^") as follows: over each individual factor, the curvature is given 
by 

C'f^-^ sup J,(/(y)--/(x)-(y(,)-a;(,),V(,)/(x))) , (12) 
7e[o,i], 

y=a;+7(s[i]-a;[i]) 

where x^^^ is x^i^ padded with zeros so that G Ai. By considering the taylor expansion of /, it is not 
hard to see that also the "partial" curvature is upper bounded by the Lipschitz-constant of the partial 
gradient V(i)/ times the squared diameter of just one domain factor A^^'^ See also in the proof of Lemma 
5 below. 

We define the global product curvature constant as the sum of these curvatures for each factor, i.e. 

n 

^prod _ J2 cf (13) 
1=1 

Observe that for the classical Frank- Wolfe case when n = 1, we recover the original duality gap as well as 
curvature constant. 



Computing the Curvature Constant C/ in the SVIVI case 

Lemma 4. For the dual structural SVM objective Junction (4) over the domain Ai :— ^\yi\ x . . . x A|j;^|, 
the curvature constant Cf, as defined in (11), is upper bounded by 

where R is the maximal length of a difference feature vector, i.e. R :— max ||'0i(y)||2 . 

ie[n],yeyi 

Proof of Lemma J^. If the function is twice differentiable, we can plug-in the second degree Taylor expan- 
sion of / into the above definition (11) of the curvature, see e.g. Jaggi (2011, Inequality (2.12)) or Clarkson 
(2010, Section 4.1). In our case, the gradient at a is given by Aa. — b, so that the Hessian is XA^A, 
being a constant matrix independent of a. This gives the following upper bound on Cf, which we can 
separate into two identical matrix-vector products with our matrix A: 

Cf< sup liy-xfV^f{z)iy~x) 

z^[x,y]CM 

= ^ • sup {A{y-x)fA{y-x) 

= — ■ sup 1 1 W — U 1 1 2 < A • sup 1 1 f 1 1 2 
2 v.weAM v£AM 

By definition of our compact domain Ai, we have that each vector v g AAi is precisely the sum of n 
vectors, each of these being a convex combination of the feature vectors for the possible labelings for data 
point i. 
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Therefore, the norm |lu||2 is upper bounded by n times the longest cohimn of the matrix A, or more 
formally ||w||2 < with R being the longest"^ feature vector, i.e. 

R max \\ip.,iy)\\2 ■ 
te[n],yeyi 

Altogether, we have obtained that the curvature C/ is upper bounded by □ 
Computing the Product Curvature Constant CJ'"' in the SVM case 

Lemma 5. For the dual structural SVM objective function (4) over the domain Ai :~ ^\yi\ x . . . x A|j;^|, 
the total curvature constant CJ™'' on the product domain M., as defined in (13), is upper bounded by 

^prod ^ 

^ ~ \n 

where R is the maximal length of a difference feature vector, i.e. R :— max ||'0i(y)||2 . 

Proof. We follow the same lines as in the above proof, but now applying the same bound to the definition 
(12) of the curvature on the i-th factor. Here, the change from x to ?/ is now restricted to only affect the 
coordinates in the i-th factor Ai^'^K To simplify the notation, let TM^'l be TM'^'^ augmented with the zero 
domain for all the other blocks - i.e. the analog of X(^i) £ tM^*) is G A^I'l. a;(i) is the j-th block of x 
whereas X[i] S is a;(i) padded with zeros for all the other blocks. We thus require that y — x £ Ai^^^ for 
a valid change from x to y. Again by the degree-two Taylor expansion, we obtain 

Cf< sup hy-xfs/'f{z){y-x) 

(y-x)eM^'^ 
ze[x,y](lM 

= ^ ■ sup {A{y - x))'^A{y - x) 

(y-x)eM^'^ 

= —■ sup ||l(7 — w||2 <A- sup ||u|j2 

In other words, by definition of our compact domain A^(*) — A\y.\, we have that each vector v € AAi^'^^ 
is a convex combination of the feature vectors corresponding to the possible labelings for data point i. 
Therefore, the norm ||u||2 is again upper bounded by the longest column of the matrix A, which means 
||''^||9 ^ 'Xn^ with R :~ maxjg[„] ^g^. ||'0i(y)||2. Summing up over the n many factors we obtain 

that the product curvature 0^°"^ is upper bounded by □ 



D. More Details on the Algorithms for Structural SVMs 

D.l. Equivalence of an Exact Frank-Wolfe Step and Loss-Augmented Decoding 

To see that the proposed Algorithm 3 indeed exactly corresponds to the standard Frank- Wolfe Algorithm 1 
applied to the SVM dual problem (4), we verify that the search direction s giving the update Ws — As is 
in fact an exact Frank- Wolfe step, which can be seen as follows: 

Lemma 6. The sparse vector s G M" constructed in the inner for-loop of Algorithm 3 is an exact solution 
to s = argmin^,g_;\^ <^s', V/(q:''^-')) for optimization problem (4). 

Proof. Over the product domain Ai ~ ^\yi\ x . . . x A|j;^^|, the minimization mins'gTvi (s', V/(q:)) decom- 
poses as X)i mi^siSAiy. I (sii Vi/(Q;)). The minimization of a linear function over the simplex reduces to a 
search over its corners - in this case, it amounts for each i to find the minimal component of ~Hi{y; w) 
over y (z i.e. solving the loss-augmented decoding problem as used in Algorithm 3 to construct the 
domain vertex s. To see this, note that for our choice of primal variables w = Aa, the gradient of the 
dual objective, V/(a) = XA^Aa — b, writes as XA^w — b. This vector is precisely the loss-augmented 
decoding function —j-^Hi{y; w), for i e [n], y e J^i, as defined in (2). □ 

^This choice of the radius R then gives = maxig[„]_j^gj;. || ^V'i(y)||2 = ™ax,g[„],j,gy. 
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D.2. Convergence Analysis 

Convergence of the Batch Frank-Wolfe Algorithm 3 on the Structural SVM Dual 

Theorem' 2 (Batch FW SVM). Algorithm 3 obtains an e- approximate solution to the structural SVM dual 
problem (4) and duality gap g{a.''^^) < e after at most O iterations, where each iteration costs n 

oracle calls. 

Proof. We apply the known convergence results for the standard Frank- Wolfe Algorithm 1, as given e.g. 
in Frank and Wolfe (1956); Dunn and Harshbargcr (1978); Jaggi (2011), or as given at the end of the proof 
of Theorem 7: For each fc > 1, the iterate a^'^^ of Algorithm 1 (either using the predefined step-sizes, or 
using line-search) satisfies E[/(a'^'°')] — /(a*) < |^ , where a.* £ Ai is an optimal solution to problem (4). 

Furthermore, if Algorithm 1 is run for K > 2 iterations, then it has an iterate a''^-', 1 < k < K , with 
duality gap bounded by E[(7(q;*^''))] < ^^^^ , as shown e.g. in Jaggi (2011), or also in our proof for the 
generalized product variant provided in Appendix E, when the number of factors is set to one. 

Now for the SVM problem and the equivalent Algorithm 3, the claim follows from the curvature bound 
Cf < ^ for the dual structural SVM objective function (4) over the domain Ai :— A|;yj| x . . . x A|;y^|, as 
given in the above Lemma 4 □ 

Convergence of the Stochastic Frank-Wolfe Algorithm 4 on the Structural SVM Dual 

Theorem' 3 (Stochastic FW SVM). If L^^x < then Algorithm 4 obtains an e-approximate solution 

to the structural SVM dual problem (4) and expected duality gap E[g(Q;'-'^))] < e after at most O (^^^ 
iterations, where each iteration costs a single oracle call. 

If L max > then the line-search variant of Algorithm 4 will require an additional (constant in e) 

number o/ 2n log (^^^^^g^) steps to get the same error and duality gap guarantees, whereas the predefined 
step-size variant will require an additional O ( "-^^"^ ^ steps. 

Proof. Writing ho = f{oi^'^'') — f{a*) for the error at the starting point used by the algorithm, the 
convergence Theorem 1 states that if fc > and k > ^ {2CY°'^ ho) , then the expected error is E[/(q:*^'^))] — 
/(cK*) ^ £ and analogously for the expected duality gap. The result then follows by plugging in the 
curvature bound Cy""^ < ^ for the dual structural SVM objective function (4) over the domain M :— 
A|^j| X ... X A|^^^|, as detailed in Appendix C (notice that it is n times smaller than the curvature C/ 
needed for the batch algorithm) and then bounding Hq. To bound ho, we observe that by the choice of the 
starting point a^°) using only the observed labels, the initial error is bounded as ho < (/(a*-"-*) ~ ti^s = 
i ^"^j^ maxygj;. Li(y) < imax- Thus, if ii-^^x 5: then we have ho < 2CY°'^, which proves the first 
part of the theorem. 

In the case Lmax > then the predefined step-size variant will require an additional ^sdis- < ^"-Lmax 
steps as we couldn't use the fact that ho < 20^°"^. For the line-search variant, on the other hand, we can 
use the improved convergence Theorem 10 given in Appendix E, which shows that the algorithm require 

ko = 2rilog ( — '-^^ ) steps to reach the condition ho < 2C?™'*; once this condition is satisfied, we can 



simply re-use Theorem 1 with k redefined as A; — fco to get the final convergence rates. We also point out 
that the statement of Theorem 10 stays valid by replacing Cy""^ with any Cy"^' > Cy""^ in it. So plugging 

in Cy""^' = ^ and the bound ho < Lmax in the ko quantity gives back the number of additional steps 
mentioned in the second part of the theorem statement. □ 

We note that the condition Lmax < is not necessarily too restrictive in the case of the structural 
SVM setup. In particular, the typical range of A which is needed for a problem is around 0{l/n) - and 
so the condition becomes Lmax < Si?^ which is typically satisfied when the loss function is normalized. 

D.3. Implementation 

We comment on three practical implementation aspects of Algorithm 4 on large structural SVM problems: 
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Memory For each datapoint i, our Algorithm 4 stores an additional vector Wi G holding the contri- 
bution of its corresponding dual variables a,; to the primal vector w = Aol^ i.e. Wi — AoL^i^, where ajij is 
OLi padded with zeros so that OL\j^ e M™ and ot — OL[i\- This means the algorithm needs more memory 
than the direct (or batch) Frank- Wolfe structural SVM Algorithm 3, but the additional memory is usually 
bounded by a constant times the size of the input data itself. In practice, in the case that the feature 
vectors tpi{y) are sparse, we can even get the same improvement in memory requirements for Wi, since 
ipi{y) usually have the same sparsity pattern for fixed i. 

Duality Gap as a Stopping Criterion Analogous as in the "classical Frank- Wolfe" structural SVM 
Algorithm 3 explained in Section 4, we would again like to use the duality gap g{a'^'''^) < e as the stopping 
criterion for the faster Algorithm 4. Unfortunately, since now in every step we only update a single of 
the many factors, such a single direction S(j-) will only determine the partial gap g^'^^a'-''^) in the i-th 
factor, but not the full information needed to know the total gap g{a.^''^). Instead, to compute the total 
gap, a single complete (batch) pass through all datapoints as in Algorithm 3 is necessary, to obtain a full 
linear minimizer s G M. For efficiency reason, we therefore only compute the duality gap every say Nn 
iterations for some constant > 1. Then stopping as soon as g(a^'^^) = g{w^''\i^'^\ws,is) < £ will not 
affect our convergence results. 

Line-Search To compute the line-search step-size for the coordinate descent Frank- Wolfe on the struc- 
tural SVM, we recall that ^opt := ^x\\A{a-s)\\'^ ' ''^'^i*^'^ valid for any s £ Ai. For the coordinate descent 
Frank- Wolfe, s is equal to a for all blocks, except for the i-th block - this means that a~ s — a^i] — s^i], i.e. 
is zero everywhere except on the i-th block. By recalling that Wi = Act^^ is the individual contribution to w 
from ai which is stored during the algorithm, we see that the denominator thus becomes A ||j4(q; — s)||^ = 
A IjiUi — Wall . The numerator is {a — s, V/(a)) — {ot — s)^ {XA'^ Aa — b) = X{wi — Ws)'^w — ii+ig, where 
as before £i = b'^a^j^ is maintained during Algorithm 4 and so the line-search step-size can be computed 
efficiently. We mention in passing that when is the exact minimizer of the linear subproblem on A4^'^\ 
then the numerator is actually a duality gap component g^^\a) as defined in (15) - the total duality gap 
is then y^'-' {a) — g^*^ (a) which can only be computed if we do a batch pass over all the data points, 
as explained in the previous paragraph. 

D.4. More details on the Kernelized Algorithm 

Both Algorithms 3 and 4 can be used with kernels by maintaining explicitly the sparse dual variables a^'^^ 
instead of the primal variables ii;''^^. In this case, the classifier is only given implicitly as a sparse com- 
bination of the corresponding kernel functions, i.e. w = Aa, where iptiv) = k(xf,yf; •, •) — k{xf,y; ■, ■) 
for a structured kernel k : {X x y) x {X x y) ^ M.. Note that the number of dual variables grow with the 
number of iterations, and so the time to take dot products also grows linearly. 

Algorithm 5: Kernelized Dual Coordinate Descent Frank- Wolfe for Structural SVM 

Start with a^°^ := (e^^?, . . .,ev'^) &M = Aiy^i x . . . x A|j;„| 
for fc = . . . J-iT do 

Pick i eu.a.r. [n] , and let 7 := 

Solve yi := argmax Hi[y] Aa!^^^) (Solve the loss- augmented decoding problem (2)) 

S(i) := e^* g A^(*' (having only a single non-zero entry) 

Optionally: Recalculate the best 7 by line-search 

Update af :== (1 - 7)0:^'' + 7S(,) 

(If line-search is used, then we also update the value b^a using fo-^s^j) j 

end 



E. Analysis of the Block-Coordinate Frank-Wolfe Algorithm 2 

Coordinate Descent Methods Despite their simplicity and very early appearance in the literature, 
surprisingly few results were known on the convergence (and convergence rates in particular) of coordinate 
descent type methods. Recently, the interest in these methods has grown again due to their good scalability 
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to very large scale problems as e.g. in machine learning, and also sparked new theoretical results such as 
(Nesterov, 2010). 

Constrained Convex Optimization over Product Domains We consider the general constrained 
convex optimization problem 

min fix) (14) 
xeM 

over a Cartesian product domain M — Ai'^^^ x . . . x A^(") C E™, where each factor A^(') C E™' is convex 
and compact, and X]r=i ~ ^^^^ write S E™' for the j-th block of coordinates of a vector 

X £ E™, and for the padding of X(i) with zeros so that X[i] G E"* 

Nesterov's "Huge Scale" Coordinate Descent If the objective function / is strongly smooth (i.e. 
has Lipschitz continuous partial gradients V(i)/(a;) G E™*), then the following algorithm converges^ at a 
rate of , or more precisely as shown in Nesterov (2010, Section 4): 

Algorithm 6: Uniform Coordinate Descent Method, (Nesterov, 2010, Section 4) 

Let e M 

for fc = . . . oo do 

Pick i eu.a.r. [n] 

Compute S(j) := argmin^^.jg^(i) (V(i)/(a;('')), S(j)) + ^ \\s(i) - X(^i)\\ 

Update ^~'^(*(^) ~ ^(0-^ (only affecting in the i-th coordinate block) 

end 



Using Simpler Update Steps: Fra nk- Wolfe / Conditional Gradient Methods In some large-scale 
applications, the above computation of the update direction S(j) can be problematic, e.g. if the Lipschitz 
constants Li are unknown, or — more importantly — if the domains A^^'^ are such that the quadratic term 
makes the subproblem for S(i) hard to solve. 

The structural SVM is a nice example where this makes a big difference. Here, each domain factor Ai^^^ 
is a simplex of exponentially many variables, but nevertheless the linear subproblem over one such factor 
is often relatively easy to solve. 

We would therefore like to replace the above computation of S(i) by a simpler one, as proposed in the 
following algorithm variant: 

Algorithm 7: Cheaper Coordinate Descent with "Frank- Wolfe" -Type Update Steps 

Let a;(0) e M 

for fc = . . . oo do 

Pick i £u.a.r. [n] 

Let 1--^T^ 



Compute := argmin (s(i),V(i)f{x'^^'^] 

Or alternatively, find S(i) that solves this linear problem up to an additive error of -iCy 

Optionally: Perform line-search for the step-size: 7 :— argmin / ix^^^ + ^(^W ~ ^fi^' 

7e[o,i] ^ 

Update s^lf)^^' '■= x'll^ +7(s(i) — (only affecting in the i-th coordinate block) 



end 



This natural coordinate descent type optimization method picks a single one of the n factors uniformly 
at random, and in each step leaves all other factors unchanged. 

If there is only one factor [n — 1), then Algorithm 7 becomes the standard Frank- Wolfe (or conditional 
gradient descent) algorithm (Frank and Wolfe, 1956), which is known to converge^ at a rate of 0(l/fc). 

* By additionally assuming strong convexity of / w.r.t. the ^i-norm (global on M, not only on the individual 
factors), one can even obtain linear convergence rates, see again Nesterov (2010) and the follow-up paper 
Richtarik and Takac (2011). 

^ This convergence analysis is usually done using the fact that in each step, the primal error is reduced by a 
constant times the squared duality gap, see e.g. Frank and Wolfe (1956); Jaggi (2011) 
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Related Work In contrast to the randomized choice of coordinate which we use here, the analysis 
of cyclic coordinate descent algorithms (going through the factors sequentially) seems to be notoriously 
difficult, such that until today, no analysis proving a convergence rate is known. For product domains, such 
a cyclic analogue of our Algorithm 7 has already been proposed in Patriksson (1998), using a generalization 
of Frank- Wolfe iterations under the name "cost approximation". The analysis of (Patriksson, 1998) shows 
asymptotic convergence, but since the method goes through the factors sequentially, no convergence rates 
could be proven so far. 

E.l. Convergence Analysis 

Decomposition of the Duality Gap The product structure of our domain has a crucial effect on the 
duality gap, namely that it decomposes into a sum over the n components of the domain. The Fenchel- 
Legendre duality gap (see Jaggi (2011, Section 2.2)) for any constrained convex problem of the above form 
(14), for a fixed feasible point x G Ai, is given by 

g{x) := g{x; V/(a;)) max {x - s, Wf{x)) 
seM 

n 
n 

1=1 

Curvature Also, the curvature can now be defined on the individual factors, 

Cf-= sup J^(/(y)_/(x)-(y(,)-X(,),V(,)/(x))) . (16) 

xeM,s^i-,eM'-''\ 

76[0,1], 
y=a;+7(s[i]-a;[i]) 

We define the global product curvature as the sum of these curvatures for each factor, i.e. 

n 

1=1 

Approximate Subproblems To simplify the distinction between the two variants of Algorithm 7, 
i.e. the exact versus the approximate solution of the linear subproblems, we use the notation s = 
LinearMin (c, TW^*') := argmin^g^ci) (c, s) and s — ApproxLinearMin (c,A^ if s € A^*^*^ is 
an approximate minimizer of bounded additive error, i.e. (c, s) < argmingg^(i) (c, s) + e' . 

E.2. Primal Convergence on Product Domains 

The following theorem shows that after O(-) many iterations. Algorithm 7 obtains an e-approximate 
solution. 

Theorem 7 (Primal Convergence). For each k > 0, the iterate x'^^^ of the exact variant of Algorithm 7 
( either using the predefined step-sizes, or using line-search ) satisfies 

E[f{x(''^)]-f{x*) < ^-^(2Cp' + /(x(0))-/(a;*)) , 

where x* G M. is an optimal solution to problem (14)- For the approximate variant of Algorithm 7, it 
holds that 

nf{x(''^)]-f{x*) < ^^(4Cf°^ + /(x(0))-/(a;*)) . 

(In other words both algorithm variants deliver a solution of (expected) primal error at most £ after 0(j) 
many iterations.) 

The proof of the above theorem on the convergence-rate of the primal error crucially depends on the 
following Lemma 8 on the improvement in each iteration. 
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Lemma 8. Let 7 G [0, 1] be an arbitrary fixed step-size. Moving only within the i-th factor of the domain, 
we consider two variants of steps towards a direction s^^j G A^^*-'; Let a;^/^^"'^' := x{'j) be the point obtained 
by moving towards S(j) using step-size 7, and let x^^g^'^ := a;(7Ls) be the corresponding point obtained 
by line-search, i.e. jls ■— argmin /(a;(7)). Here for convenience we have used the notation x{j) :— 

76 [0,1] 

xW+7(sH-4?) V7e [0,1]. 

If the direction sjj) is given by S(i) :— LinearMin (V(i)/(x), A^*^*)), then in expectation over the random 
choice of the factor i and conditional on the value of x^''\ it holds that 

E[/(4s'''^)] <E[/(a:('=+i))] < /(a:^) _ 715(3:^) + 72!^-- . 

// the approximate primitive S(i) := ApproxLinearMin ^V(i)/(a;), 7C|*'^ is used instead, then 
the above inequality holds when the constant C^™'' is replaced by instead. 

Proof. We write x := x*^*^-', y :~ x^-y^^^^ = x + j{s[i] — with X[i] and S[i] being zero everywhere except 
in their i-th block. We also write dx := V(i)/(a;) to simplify the notation, and first prove the approximate 

case of the lemma. We use the definition (16) of the curvature constant cj-*"* of our convex function / over 
the factor to obtain 

/(y) = fix + l{S[^]-X[,])) 

< f{x) + 7(s(i) - da;) + T^C*/ ^ • 

Now we use that the choice of S(i) := ApproxLinearMin (d^, A^*-*-', e') is a good descent direction for the 
linear approximation to / at x, on the z-th factor A^^*'. Formally, we are given a point sj^) that satisfies 
{s(i), dx) < min (y, d^) + e', or in other words we have 

~ a;(j) ,dx) < min^g_^(.) {y,dx) - {x(^,) ,dx) +e' 
= -.gW(x)+e', 

from the definition (15) of the duality gap. Altogether, we obtain 

f{y)< f{x) + j{-g^^H^) + s') + i^cf 

= f{x)-^g^^{x)+2^^cf , 

the last equality following by our choice of e' — "^c'^f' ■ Using that the line-search by definition must lead 
to an objective value at least as good as the one at the fixed 7, we therefore have shown the inequality 

/(4s'-'^)</(4^-+^^) < f{x^^^)-ig^'^{x^'^) + 2^^cf . 

Finally the claimed bound on the expected improvement directly follows by taking the expectation: 
With respect to the (uniformly) random choice of the factor z, the expected value of the gap g'^'^^ {x'^^^) 
corresponding to the picked i is exactly ^g{x^'^^). Also, the expected curvature of the i-th factor is ^^C^"'^ ■ 

This proves the lemma for the approximate case. The analogous claims for the exact linear primitive 
LinearMin() follows by the same proof for e' = 0. □ 

Having Lemma 8 at hand, we will now prove our above primal convergence Theorem 7 using similar 
ideas as for general domains, see Jaggi (2011, Section 2.3), in Clarkson (2010, Theorem 2.3). 

Proof of Theorem 7. From the above Lemma 8, we know that for every inner step of the exact variant 
of Algorithm 7 and conditioned on the value of x^^\ we have that 'E,[f{x'"-y^^'^)] < f{x^^'>) — ^■^^g{x^^'>) -\- 
7^^Cj™'^, where the expectation is over the random choice of the factor i (this holds independently whether 
line-search is used or not). Writing h{x) := f{x) — f{x*) for the (unknown) primal error at any point x, 
this reads as 

E[/J(4'^'^)] < /^(xW)-7^5(a;('=')+7'^C^r' 

< /^(.tC^)) -7i/i(a;('=)) +72^6'^"'^ (18) 
= (l-^)/,(/^))+72iC-°^ 
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where in the second hne, we have used weak duahty h{x) < g{x) as explained e.g. in Jaggi (2011, 
Inequality (2.6)). The inequality (18) is conditional on the value of x^''\ which is a random quantity 
given the previous random choices of factors to update. We get a deterministic inequality by taking the 
expectation of both sides with respect to the random choice of previous factors, yielding: 

E[/i(4'+'))] < (1 - ^) EihixC'))] + j^^Cy-". (19) 

We observe that the resulting inequality (19) is of the same form as the one appearing in the standard 
Frank- Wolfe primal convergence proof such as in Jaggi (2011, Theorem 2.3), though with a crucial difference 
of the 1/n factor (and that we are now working with the expected values E[/i(a;*^'^^)] instead of the original 
h{x^'^^)). We will thus follow a similar induction argument over fc, but we will see that the 1/n factor will 
yield a slightly different induction base case (which for n = 1 can be analyzed separately to obtain a better 
bound). To simplify the notation, let hk := E[/i(a;''^^)]. 

By induction, we are now going to prove that 

2nC 

hk < for fc > . 

k + 2n 

for the choice of constant C :— 20^°'^ + ho- 

The base-case k — follows immediately from the definition of C, given that C > hg. 

Now we consider the induction step for fc > 0. Here the bound (19) for the particular choice of step-size 
k'+2n ^ f] gi'^sri by Algorithm 7 gives us (and also for the line-search variant given that its 
bound is valid for any 7): 

< (1- i4^)M^ + (nV)^2nC , 

where we have used that 0^°'^ < C/2, and in the last inequality we have plugged in the induction 
hypothesis for h^- Simply rearranging the terms gives 



2nC 



< 



k+2n \ k+2n ' k+2n 
2nC k+2n-l 
k+2n k+2n 
2nC k+2n 
k+2n k+2n+l 

2nC 
fe+l+2n ' 



which is our claimed bound for fc > 0. 

The analogous claim for Algorithm 7 using the approximate linear primitive ApproxLinearMin() 
follows from exactly the same argument, by replacing every occurrence of Cy° in the proof here by 2Cy° 
instead (compare to Lemma 8 also). □ 



Domains Without Product Structure Our above convergence result also holds for the case of the 
standard Frank- Wolfe algorithm, when no product structure on the domain is assumed, i.e. for the case 
n — I. In this case, the constant in the convergence can even be improved, since the additive term given 
by ho, i.e. the error at the starting point, will disappear: This is because already after the first step, we 
obtain a bound for hi which is independent of /iq. More precisely, plugging 7^"^ := 1 in the bound (19) 
when n — I gives hi<0 + Cf""" < C. Using fc = 1 as the base case for the same induction proof as above, 
we obtain that for n ~ \: 

hk < -^2Cf°'' for all fc > 1 , 
fc + 2 - 

which matches the convergence rate given in Jaggi (2011, Theorem 2.3). Note that in the traditional 
Frank- Wolfe setting, i.e. 71 = 1, our defined curvature constant becomes 0^°'^ — C*/- 

Dependence on Hq We note that the only use of including /iq in the constant C — 20^°"^ + ft-o was 
to satisfy the base case in the induction proof, at fc = 0. If from the structure of the problem we can get 
a guarantee that /iq < 2CJ™'*, then the smaller constant C" = 20^°"^ will satisfy the base case and the 
whole proof will go through with it, without needing the extra /iq factor. See also Theorem 10 for a better 
convergence result in the case where the line-search is used. 
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E.3. Obtaining Small Duality Gap 

The following theorem shows that after 0(j) many iterations, Algorithm 7 will have visited a solution 
with e-small duality gap in expectation: 

Theorem 9 (Primal-Dual Convergence). For each K >2, the exact variant of Algorithm 7 (either using the 
predefined step-sizes, or using line-search) will yield at least one iterate x^'^^ with 1 < k < K with expected 
duality gap bounded by 

E[5(a:«)] </3^(2q'-- + /(x(°))-/(x*)) , 

where /3 = 3. 

The same statement holds for the approximate variant of Algorithm 7, when is replaced by 40^°''. 

Proof. Let K > 2 he fixed. We will actually prove that the iterate of small duality gap will appear in the 
last half of the K iterations. To simplify notation, we will denote the expected primal and dual errors for 
any iteration fc > in the algorithm by /i^*^' := E[h{x^'''^)] and g^''^ := E[g{x^''^)]. 

By our previous primal convergence Theorem 7, we already know that the primal error satisfies /i*-*^) — 
E[/(a;(''-))] - f{x*) < in any iteration k, where C := 2n(2Cf°'' + h^°^). 

Between the iteration fcjnin and K, we will now suppose that g^''"' always stays larger than ^^2n ^^*5re 
2/?if — 1 := jfj^f}^~^'^^i ■ We will derive a contradiction for this assumption and then simplify the expression 
for better interpretability of the bound. The argument below will make clear why the quantities were 
defined as such. Formally, we assume that 

gik) > ^ ^ ^^^^ _ ^^^^^^^ ... , K} . 

rZ -\- Zn 

For now, fcmin < if is arbitrary, but we will see later that the tightest bound can be obtained by using 
fcmin = It^-KI with fi = 2 — \/2 fa 0.5858, though using ^ = 1/2 gives a simpler bound. 

Now employing the crucial expected improvement bound from Lemma 8 for the choice of 7 := k+2n ' 
we have /iC^+i) < h'-''^ — 7^5^*'-* + j'^-^Cy"'^. It is important to observe that this bound holds both for 
line-search as well as for the predefined step-size 7. This gives 

< hik) 2_ (fe) , C 

- fe+2ny ^ (A:+2n)2 • 

Plugging in our assumption that the duality gap is still "large" for k G i^scti we obtain 



< 2_ PkC 



c 



k-\-2n k+2n ^ {k+2n)^ 
- hik) _ (2/3fc-l)C' 
^ (fc+2n)2 • 

The crux of the proof is now to sum up this inequality over Kgct, i-e. from k = fcmin up to fc — if and 
show that the RHS becomes negative, yielding a contradiction. From the telescoping sum on we get 

- fc„,if+2n ~ - l)C'Z]fc=fc„i„ (fe+2n)2 ' 

where in the last inequality we have just used the primal convergence Theorem 7 giving /iC^-"") < j-_iZ___ 
We can lower bound the summand term by using the fact that for any positive decreasing integrable 
function /, I]f=fc„i„ fi^) > ll^J'^^ f{t)dt. So, using f{k) := jpp^n)^, we have that 



l^k=k^i,^ (fc+2ri)2 — Jfe„i„ 



(t+2n) 



rdt 



t+2n 



t=K+l 



t=k„ 



1 I 1 _ K+l-k„ 



K+l+2n ' fc„in+2n (fe„,i„+2n)(A'+l+2n) ' 

Lower bounding the summand in the relevant inequality (and keeping in mind that (2/3k — 1) > and so 
we are bounding it in the right direction), we get 

K+l) ^ C (OR T\r^ K+l^k,-ni„ 



fc„,i„+2n V-^PK (k^i„+2n}{K+l+2n) 

C_ 

fe„i„-|-2n 



(l"(2/3K-l)^^±i^^)=0, 
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where the last equahty arises by substituting the value for Pk that we had chosen before. We thus get 
j^(K+i) ^ J g that the primal error becomes negative, which yields a contradiction. Our assumption on 
the gap is refuted, and thus we get that there exists a fc S Ksct such that 

PkC ^ PkC 



k + 2n fcmin + 2n 

where the last inequality uses the fact that k > fcmin- 

The rest of this proof amounts to transform this bound to make it more interpretable and choose an 
appropriate ^ for the definition of fcmin = [m^I • By using the fact that /iA' < \fJ-K~\ < fiK + 1, and 
substituting the value of from its definition, we have 



< 1 1 I K+l+2n \ 
- 2 K-{pK+l) + l ) 



„+2ti — 2 ' K-(pK+l) + l J fj,K+2n 

1 1 (l-M)g+if+l+2n J_o/ ^ ^ 
~ K 2{l-fj.) ^K+2ri • K^IM; 'V • 

Our goal is now to find an upper bound for K, n) as a function of /i, and find the /i* G [0, 1] which 
minimizes it. First, let's write K — an (which is always possible for some a > 0). We get 

I3{fi,an,n) = — r — . 

2(1 — ^) fj,a + 2 

Note that since /i G [0, 1], this is an increasing function of a, and so we can upper bound it by letting 
a — )• oo, thus getting 

2 - ^ 

M'^ - m) 

where the bound holds for any positive K and n. Minimizing with respect to /i € [0, 1] yields the 
optimal fj,* =2 — \f2 k. 0.5858, which yields /3^. « 2.9142. Using = 1/2 yields the simpler = 3 which 
is the one we used in the statement of our theorem and this completes the proof. Note that a more precise 
statement could state that, for any e [0, 1], there exists a fc in {[/xif] , . . . , K] such that 

(fc) < ^mC' 



f5(fj,,K,n) < lim f3{fi,an,n) = — — := /3^, 



K ' 

with f3f^ defined as above. 

Finally, the proof for the approximate variant of Algorithm 7 follows exactly the same scheme but using 
C := 2n{4CY°'' + and the relevant primal convergence statement. □ 

E.4. An Improved Primal Convergence for the Line-Search Case 

If line-search is used, we can improve the convergence result of Theorem 7 by showing a weaker dependence 
on the starting condition thanks to faster progress in the starting phase of the first few iterations: 

Theorem 10 (Improved Primal Convergence for Line-Search). For each k > kg, the iterate x''^^ of the line- 
search variant of Algorithm 7 (using the exact linear subproblem) satisfies 

E [/(xW)] - /(a;*) < ^ 



k — ko + 2n 



2c'; 



log(!:S)/(-l0gen) 



} is the number of steps required to guarantee that E [/(a 



where fco := max |0, 

fi^*) ^ 2CJ™'', with X* £ Ai being an optimal solution to problem (14), and h(x'^*^^) :— /(a;*^'^') — /(a;*) is 
the primal error at the starting point, and ^„ ;= 1 — ^ < 1 is the geometric decrease rate of the primal 
error in the first phase k < ko. 

Proof. For the line-search case, then the expected improvement guaranteed by Lemma 8, in marginal 
expectation as in (19), is valid for any choice of 7 G [0, 1]: 

^[h{x%^'^)]< (l-^)E[MxW)]+7^iC7r^ (20) 
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Because the bound (20) holds for any 7, we are free to choose the one which minimizes it subject to 
7 € [0, 1], that is 7* :~ min <{ 1, ^J^Jiod f> where we have again used the identification hk :— E [^(a;^'^^)]- 



2C 



Now we distinguish two cases: 

If 7* = 1, then hk > 2CY°'' and so (20) at 7 = 7* becomes 



< 
< 



Giving hk < (^„)'^/io < B (where we have chosen 



1 ^prod 

"1 ^ 



1 — ^), as soon as fc > \ogi/^^{ho)/ B) 



By employing this bound for B 



2^prod^ we have obtained a logarithmic bound on the 



-log(l-5k) 

number of steps that fall into the first regime case here, i.e. where hk is still "large" . Here it is crucial to 
note that the primal error hk is always decreasing in each step, due to the line-search, so once we leave 
this regime of hk > 2CY°'^ , then we will never enter it again in subsequent steps. 

On the other hand, as soon as we reach a step k (e.g. when k = fco) such that 7* < 1 or equivalently 



hk < 2CY°'^, then we are always in the second phase where 7* — 
yields the recurrence bound: 



2C 



prod 



^fc+l <hk- -h^ Vfc > fco 



Plugging this value of 7* in (20) 



(21) 



where v := AnCy"", with the initial condition hk„ < 20^°"" = ^. This is a standard recurrence inequality 
which appeared for example in Joachims ct al. (2009, Theorem 5, see their Equation (23)) or in the 
appendix of (Tco ct al., 2007). We can solve the recurrence (21) by following the argument of (Tco ct al., 
2007), where it was pointed out that since hk is monotonically decreasing, we can upper bound hk by the 
solution to the corresponding differential equations h'{t) = —h'^{t)/v, with initial condition /i(fco) = /ifco- 



Integrating both sides, we get the solution h{t) 
hk < h{k), we thus get the bound: 



t-ko+iy/hk„ ' 



Plugging in the value for hkg and since 



hk < 



k — k() + 2n 



Vfc > fc. 



0, 



which completes the proof. 

Note that since for n > 0.5 and 



(22) 



□ 



log 1 



271 J 



> Y~ foi' the natural logarithm, we get that fco ^ 



2n log 



2C 



prod 
/ 



and so unless h{x^'^^) < 20^°"^, we get a linear number of steps in n required to reach 

the second phase, but the dependence is logarithmic in h{x^°^) — instead of linear in /i(a;(°^) as given 
by our previous convergence Theorem 7 for the fixed step-size variant (in the fixed step-size variant, we 



would need fco 



2n 



2C 



prod 



steps to guarantee hk^ < 2Cy° ). Therefore, for the line-search variant of 



our Algorithm 7, we have obtained guaranteed e-small error after 



2nlog 


/ M^("))Y 






1^ 20^°" J 


+ 


e 



iterations. 

It is also interesting to point out that even though we were using the optimal step-size in the second phase 
of the above proof (which yielded the recurrence (21)), the second phase bound is not better than what we 
could have obtained by using a fixed step-size schedule of f._^^_^^2n ^^'^ following the same induction proof 
line as in the previous Theorem 7 (using the base case hk^ < 2Cy°'^ and so we could let C := 2Cy°^). This 
thus means that the advantage of the line-search over the fixed step-size schedule only appears in knowing 
when to switch from a step-size of 1 (in the first phase, when hk < 2CY°'^) to a step-size of k-ko+2n 
the second phase), which unless we know the value of f{x*), we can't know in general. In the standard 
Frank- Wolfe case where n = 1, there is no difference in the rates for line-search or fixed step-size schedule 
as in this case we know hi < Cy°'^ as explained at the end of the proof of Theorem 7. 
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